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CHAPTER 1 



Introducing Titan 



The concept of graph theory was first published in 1736 in a paper written by Leonhard 
Euler about mathematical problem Seven Bridges of Konigsberg. Till the dawn of computer 
science graph theory remained an active area of research in mathematics. However, in the 
past couple of decades we have seen wide utilisation of graph theory in computer science. 
In today's age of big data, graph theory is used for solving the challenges today. Some of the 
use cases of use of graph technology are: 

• Google's PageRank algorithm, which revolutionised search experience on World 
Wide Web is an optimization algorithm based on a graph. 

• Fraud and crime detection. 

• Social network analysis, identifying communities of interests. Google's knowledge 
graph is good example. 

• Recommendation systems to identify products that are sold together and customers 
having similar behaviour. 

• Representing and storing genes and proteins structures. 

What is Titan? 

Titan is a distributed graph database. A graph is a data structure composed of vertices and 
edges. Titan is a scalable graph database optimized for storing and querying graphs 
containing hundreds of billions of vertices and edges distributed across a multi-machine 
cluster. Titan is a transactional database that can support thousands of concurrent users 
executing complex graph traversals. 
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Figure 1-0-1 Graph data structure 
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Titan Graph Stack 



Titan is built using TinkerPop stack. TinkerPop is open source stack of graph technologies 
which provides building blocks for developing high performance graph applications. 
TinkerPop technology stack includes: 

Blueprints 

Blueprints is a property graph model interface with provided implementations. Databases 
that implement the Blueprints interfaces automatically support Blueprints-enabled 
applications. It is analogous to JDBC, but for graph databases. Titan natively implements the 
Blueprints Interface which means that it supports all of the open-source technologies in the 
TinkerPop. Being a native implementation means that Titan directly implements the 
Blueprints interface without an adapter. This makes Titan one of the most efficient 
Blueprints implementations which benefit the performance of all TinkerPop projects when 
running on Titan. 

Pipes 

Pipes is a dataflow framework that enables the splitting, merging, filtering, and 
transformation of data from input to output. Computations are evaluated in a memory- 
efficient, lazy fashion. 

Gremlin 

Gremlin is a domain specific language for traversing property graphs. This language has 
application in the areas of graph query, analysis, and manipulation. 

Frames 

Frames is an object-to-graph mapper for rendering Java objects from graph data. Instead of 
writing software in terms of vertices and edges, with Frames, software is written in terms of 
domain objects and their relationships to each other. 

Furnance 

Furnace is a property graph algorithms package. It provides implementations for standard 
graph analysis algorithms that can be applied to property graphs in a meaningful ways. 

Rexster 

Rexster is a multi-faceted graph server that exposes any Blueprints graph through several 
mechanisms with a general focus on REST. 
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Storage backend 



Titan uses multiple storage backends to persist the data. Titan can accommodate any level 
of isolation, consistency, scalability, or availability depending on storage backend. 

However, the CAP theorem stipulates that a database can only provide two of the three 
desirable properties: Consistency, Availability, and Partitionability (i.e. scalability). The 
choice of storage backend is guided by the requirements of a particular use case. 

Titan currently supports four storage backends which can be mapped on CAP triangle is 
shown below. 

• Cassandra 

• Apache HBase 

• Oracle Berkeley DB 

• Persistit 



Partitionability 




Figure 1-2 Titan storage backends mapped on CAP triangle 



Storage backend 


Consistency 


Availability 


Scalability 


Replication 


Cassandra 


eventual 
consistency 


highly available 


linear scalability 


Yes 


HBase 


vertex 
consistency 


Single point of failure 


linear scalability 


Yes 


BerkeleyDB 


ACID 


Single point of failure 


single machine 


In HA mode 


Persistit 


ACID 


Single point of failure 


single machine 


No 



Table 1.1 Comparison of storage backends 
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Indexing backend 

Titan provides two types of index systems. The standard index and the external index 
interface which supports an arbitrary number of external indexing backends to provide 
support for geo, numeric range, and full-text search. 

The standard index is very fast and always available without any further configuration, but 
only supports exact index matches. In other words, the standard index can only retrieve 
vertices and edges by matching one of their properties exactly. 

The external index interface is more flexible and supports retrieving vertices and edges by 
bounding their geo-location, properties that fall into a numeric range or matching tokens in 
full text. The external index interface connects to separate systems as index backends for 
indexing and retrieval similarly to how storage backends are used for persistence. Index 
backends need to be configured in the graph configuration before they can be used. 

The choice of index backend determines which search features are supported, as well as the 
performance and scalability of the index. 

Titan currently supports two index backends: 

• Elasticsearch 

• Lucene 

Elasticsearch 

Titan supports Elasticsearch as an embedded or remote index backend. In embedded mode, 
Elasticsearch runs in the same JVM as Titan and stores data on the local machine. In remote 
mode, Titan connects to a running Elasticsearch cluster as a client. 

Lucene 

Titan supports Lucene as a single-machine, embedded index backend. Lucene has a slightly 
extended feature set and performs better in small-scale applications compared to 
Elasticsearch, but is limited to single-machine deployments. 
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CHAPTER 2 



Installation & Setup with Cassandra 

Cassandra is a massively scalable open source NoSQL column store. Cassandra can store 
large amounts of structured, semi-structured, and unstructured data across multiple data 
centres and the cloud. Using Cassandra as Titan backend store has following advantages: 

• High availability with no single point of failure. 

• No read/write bottlenecks to the graph as there is no master/slave architecture. 

• Elastic scalability allows for the introduction and removal of machines. 

• Caching layer ensures that continuously accessed data is available in memory. 

• Increase the size of the cache by adding more machines to the cluster. 

• Integration with Hadoop. 

Titan can be configured with Cassandra as a storage back in three different ways. 



Local Server Mode 

In this mode Titan, Cassandra and end user application are installed on the same node. In 
this model, Titan and Cassandra communicate with one another via a localhost socket. 
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Figure 2.1 Cassandra local server mode 

Running Titan using local Cassandra requires the following setup steps: 
1. Download minimal version of Titan/Cassandra. 



rizwan@ubuntu : ~$ wget http : / /s3 . thinkaurelius . com/ downloads/ titan/ titan- cassandra-0 . 4 . 4 . zip 



2. Unzip the download zip file. 



rizwan@ubuntu: ~S unzip titan-cassandra-0 .4.4. zip 



3. Start Cassandra. 



rizwan@ubuntu : ~$ . /titan-cassandra-0 .4.4 /bin/cassandra 
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4. Now create Cassandra Titan Graph using the Gremlin shell to test that your setup is 
working correctly. 



rizwan@ubuntu : ~$ titan- c as sandra-0 .4.4 /bin /gremlin . sh 
\,,,f 
(o o) 

0OO0- (_) -0OO0 

gremlin> configuration = new BaseConf iguration ( ) 

==>org . apache . commons . configuration . BaseConf iguration@7e3 90 a9d 

gremlin> configuration . set Property ( "storage .backend" , "cassandra" ) 
==>null 

gremlin> configuration . set Property ( "storage . hostname" ,"127.0.0.1") 

==>null 

gremlin> graph = TitanFactory . open (configuration) 
==>titangraph [cassandra : 127 . 0 . 0 . 1 ] 



Remote Server Mode 

As your data grows beyond the limits that a single machine can hold, then Titan and 
Cassandra can logically be separated into different machines. In this model, any number of 
Titan instances hold socket-based read/write to the Cassandra cluster. The end-user 
application can directly interact with Titan within the same JVM as Titan. 
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Figure 2-2 Cassandra remote server mode 

Steps to setup Titan with Cassandra in remote server mode are as following: 

1. Setup Cassandra cluster by following the installation instructions on Datastax . 

2. Add Titan jar to the project. If you are using maven, you can add the following 
dependency to the pom.xml. 



<dependency> 

<group!d>com . thinkaurelius . titan</groupId> 
<artif actId>titan-core</artif actld> 
<version>0 . 4 . 4</version> 

</dependency> 



3. Now connect Titan with the Cassandra cluster. Suppose we have a running 

Cassandra cluster of two nodes with IP address 192.168.0.1 and 192.168.0.2, then 
connecting Titan with the cluster is accomplished as: 



Configuration configuration = new BaseConf iguration ( ) ; 
configuration . set Property ( "storage .backend" , "cassandra" ) ; 

configuration . set Property ( "storage . hostname" ,"192.168.0.1,192.168.0.2") ; 

TitanGraph graph = TitanFactory . open (configuration) ; 
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Remote Server Mode with Rexter 

Rexster is a graph server that exposes any Blueprints graph through REST and a binary 
protocol called RexPro. The HTTP web service provides standard low-level GET, POST, PUT, 
and DELETE methods, a flexible extensions model which allows plug-in like development for 
external services (such as adhoc graph queries through Gremlin), server-side "stored 
procedures" written in Gremlin, and a browser-based interface called The Dog House. 
Rexster Console makes it possible to do remote script evaluation against configured graphs 
inside of a Rexster Server. 




Figure 2-3 Cassandra remote server with Rexster 



Rexster can be wrapped around each Titan instance. This type of setup is useful for polyglot 
architectures as the end-user application need not be a Java-based application as it can 
communicate with Rexster over REST. 

Running Cassandra remote server with Rexster requires the following steps: 

1. Setup Cassandra cluster by following the installation instructions on Datastax . 

2. Download Titan Server 



rizwan@ubuntu : ~$ wget http : / /s3 . thinkaurelius . c om/ do wnlo ads/ titan/ titan- server- 0 . 4 . 4 . zip 



3. Unzip the download zip file. 



rizwan@ubuntu: ~$ unzip titan-server-0 .4.4. zip 



4. Run Titan Server 



rizwan@ubuntu : ~$ - /titan-server-0 .4.4 /bin/ titan . sh start 



5. Connect to Rexster using Rexster console to test installation. 



rizwan@ubuntu : ~$ . /titan-server-0 .4.4 /bin/ rexster- console . sh 

u_(i 
( { o o 

{ (-Y-) <woof> 

1 1 1 1 

11// 11// 

opening session [127.0.0.1:8184] 
?h for help 
rexster [groovy] > 



6. Repeat steps 2), 3) 4) & 5 on N number of nodes depending on scalability 
requirements of the application. 
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Titan Embedded Mode 



Finally, Cassandra can be embedded in Titan, which means, that Titan and Cassandra run in 
the same JVM and communicate via in process calls rather than over the network. This 
removes the (de)serialization and network protocol overhead and can therefore lead to 
considerable performance improvements. In this deployment mode, Titan internally starts a 
cassandra daemon and Titan no longer connects to an existing cluster but is its own cluster. 

To use Titan in embedded mode, simply configu re embeddedcassandra 3S the storage backend. 
The configuration options listed below also apply to embedded Cassandra. In creating a 
Titan cluster, ensure that the individual nodes can discover each other via the Gossip 
protocol, so setup a Titan-with-Cassandra-embedded cluster much like you would a 
standalone Cassandra cluster. When running Titan in embedded mode, the Cassandra yaml 
file is configured using the additional configuration option storage. cassandra-config-dir, 

Which Specifies the yaml file aS a full Url, e.g. storage, cassandra-config-dir 
file : ///home/cassandra . yaml. 

When running a cluster with Titan and Cassandra embedded, it is advisable to expose Titan 
through the Rexster server so that applications can remotely connect to the Titan graph 
database and execute queries. 

Note, that running Titan with Cassandra embedded requires GC tuning. While embedded 
Cassandra can provide lower latency query answering, its GC behaviour under load is less 
predictable. 
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Cassandra Specific Configuration 



Option 


Description 


Type 


Default 


Modifiable 


storage. keyspace 


Name of the keyspace in which to store 
the Titan specific column families 


String 


titan 


No 


storage. hostname 


IP address or hostname of the Cassandra 
cluster node that this Titan instance 
connects to. Use a list of comma- 
separated hostnames or IP addresses to 
seed multiple Cassandra nodes. 


IP address or 
hostname 


None 


Yes 


storage. port 


Port on which to connect to Cassandra 
cluster node. 


Integer 


9160 


Yes 


storage. connection-timeout 


Default time out in milliseconds after 
which to fail a connection attempt with a 
Cassandra node. 


Integer 


10000 


Yes 


storage. connection-pool-size 


Maximum size of the connection pool for 
connections to the Cassandra cluster. 


Integer 


32 


Yes 


storage. read-consistency-level 


Cassandra consistency level for read 
operations. 




QUORUM 


Yes 


storage, write-consistency-level 


Cassandra consistency level for write 
operations. 




QUORUM 


Yes 


storage. replication-factor 


The replication factor to use. The higher 
the replication factor, the more robust 
the graph database is to machine failure 
at the expense of data duplication. The 
default value should be overwritten for 
production system to ensure robustness. 
A value of 3 is recommended. This 
replication factor can only be set when 
the keyspace is initially created. On an 
existing keyspace, this value is ignored. 


Integer 


1 


No 


storage. cassandra. thrift.frame_size_mb 


The maximum frame size to be used by 
thrift for transport. Increase this value 
when retrieving very large result sets. 
Only applicable when 
storage. backend=cassandrathrift . 


Integer 


16 


No 
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